AITopics | reasoning effort

Collaborating Authors

reasoning effort

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Matta, Shiho, Pereira, Lis Kanashiro, Han, Peitao, Cheng, Fei, Kitazawa, Shigeru

arXiv.org Artificial IntelligenceNov-6-2025

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.26241

Country: Asia > Japan (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Beger, Claas, Yi, Ryan, Fu, Shuhao, Moskvichev, Arseny, Tsai, Sarah W., Rajamanickam, Sivasankaran, Mitchell, Melanie

arXiv.org Artificial IntelligenceOct-8-2025

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models' abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models' rules are often based on surface-level ``shortcuts'' and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models' output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models' abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

accuracy, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.02125

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Energy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

Zhao, James Xu, Hooi, Bryan, Ng, See-Kiong

arXiv.org Artificial IntelligenceSep-9-2025

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.06861

Country:

North America > United States (0.93)
Europe (0.67)
North America > Canada > Ontario > Toronto (0.15)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Antaki, Fares, Mikhail, David, Milad, Daniel, Mammo, Danny A, Sharma, Sumit, Srivastava, Sunil K, Chen, Bing Yu, Touma, Samir, Sevgi, Mertcan, El-Khoury, Jonathan, Keane, Pearse A, Chen, Qingyu, Tham, Yih Chung, Duval, Renaud

arXiv.org Artificial IntelligenceAug-15-2025

Importance: Novel large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Objective: To evaluate the performance and cost-accuracy trade-offs of OpenAI's GPT-5 compared to previous generation LLMs on ophthalmological question answering. Design, Setting, and Participants: In August 2025, 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the AAO Basic Clinical Science Course (BCSC) dataset. The study did not include human participants. Main Outcomes and Measures: The primary outcome was accuracy on the 260-item ophthalmology multiple-choice question set for each model configuration. Secondary outcomes included head-to-head ranking of configurations using a Bradley-Terry (BT) model applied to paired win/loss comparisons of answer accuracy, and evaluation of generated natural language rationales using a reference-anchored, pairwise LLM-as-a-judge framework. Additional analyses assessed the accuracy-cost trade-off by calculating mean per-question cost from token usage and identifying Pareto-efficient configurations. Results: The configuration GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985),

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.09956

Country:

North America > United States > Ohio (0.28)
North America > Canada > Ontario (0.28)
North America > United States > California (0.28)

Genre: Research Report > Experimental Study (0.36)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.47)

Add feedback

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

Luettgau, Lennart, Coppock, Harry, Dubois, Magda, Summerfield, Christopher, Ududec, Cozmin

arXiv.org Artificial IntelligenceJul-15-2025

As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.05602

Genre:

Research Report > Experimental Study (0.70)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Advanced AI suffers 'complete accuracy collapse' in face of complex problems, study finds

The GuardianJun-9-2025, 16:12:02 GMT

Apple researchers have found "fundamental limitations" in cutting-edge artificial intelligence models, in a paper raising doubts about the technology industry's race to develop ever more powerful systems. Apple said in a paper published at the weekend that large reasoning models (LRMs) – an advanced form of AI – faced a "complete accuracy collapse" when presented with highly complex problems. It found that standard AI models outperformed LRMs in low-complexity tasks, while both types of model suffered "complete collapse" with high-complexity tasks. Large reasoning models attempt to solve complex queries by generating detailed thinking processes that break down the problem into smaller steps. The study, which tested the models' ability to solve puzzles, added that as LRMs neared performance collapse they began "reducing their reasoning effort".

large language model, machine learning, natural language, (19 more...)

The Guardian

Country: Asia > Vietnam > Hanoi > Hanoi (0.06)

Genre: Research Report (0.39)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.57)

Add feedback

Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents

Li, Xiang, Hao, Yiyang, Fulop, Doug

arXiv.org Artificial IntelligenceMay-8-2025

RL game playing agents are traditionally initialized with zero pre-existing knowledge about a specific game environment and learn to play the game through millions of interactions with the environment. Significant time and compute is often spent exploring states that will not be experienced during high scoring policies. Exploration is particularly challenging in environments that require long horizon action sequences and provide sparse rewards, such as the Atari games and real-world robotics challenges where the state space is too large to effectively sample through free-form exploration. In this paper we will explore whether pretrained general RL agents like reasoning LLMs can play Atari games and investigate ways to leverage pretrained RL agents to reduce the training samples for training smaller agents from scratch. We first explore whether the contextual under-1 Stanford University.

in-context, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.03947

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Workflow (0.91)
Research Report > New Finding (0.67)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

Estermann, Benjamin, Wattenhofer, Roger

arXiv.org Artificial IntelligenceMar-19-2025

Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.15113

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Yang, Wenkai, Ma, Shuming, Lin, Yankai, Wei, Furu

arXiv.org Artificial IntelligenceFeb-25-2025

Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.

arxiv preprint arxiv, qwen2, reasoning effort, (12 more...)

arXiv.org Artificial Intelligence

2502.1808

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback